与普通的计算机视觉任务不同,将图像操作检测任务更多地关注图像的语义内容,更关注图像操纵的微妙信息。在本文中,通过改进的约束卷积提取的噪声图像用作模型的输入,而不是原始图像,以获得更微妙的操纵痕迹。同时,由高分辨率分支和上下文分支组成的双分支网络被用来尽可能捕获伪像的痕迹。通常,大多数操纵将操纵伪像在操纵边缘上。专门设计的操纵边缘检测模块是基于双分支网络构建的,以更好地识别这些工件。图像中像素之间的相关性与它们的距离密切相关。两个像素越远,相关性越弱。我们为自我发场模块添加了一个距离因子,以更好地描述像素之间的相关性。四个公开图像操作数据集的实验结果证明了我们模型的有效性。
translated by 谷歌翻译
基础学习者和集合中的样本(镜头)几乎没有弹出分类器极大地影响了模型性能。当表现不满意时,通常很难理解基本原因并进行改进。为了解决这个问题,我们提出了一种视觉分析方法FSLDIAGNOTOR。考虑到一组基础学习者和一系列射击的样本,我们考虑了两个问题:1)找到一个很好的基础学习者,可以很好地预测样本集; 2)用更多代表性的镜头代替低质量的镜头,以充分代表样品集。我们将两个问题提出为稀疏子集选择,并开发两种选择算法,分别推荐适当的学习者和射击。将矩阵可视化和散点图组合在一起,以解释上下文中推荐的学习者和镜头,并促进用户调整它们。根据调整,该算法更新了建议结果,以进行另一轮改进。进行了两项案例研究,以证明FSLDIAGNOTOR有助于有效地构建一些分类器,并分别将精度提高12%和21%。
translated by 谷歌翻译
我们将简要介绍本文Trecvid2021中WHU-nercms的实验方法和结果。今年,我们参加了实例搜索的自动和交互式任务(INS)。对于自动任务,检索目标分为两个部分,人检索和动作检索。我们采用了两阶段方法,包括对人检索的面部检测和面部识别以及由三种基于框架的人类对象相互作用检测方法和两种基于视频的一般动作检测方法组成的两种动作检测方法。在那之后,人的检索结果和动作检索结果被融合以初始化结果排名列表。此外,我们尝试使用互补方法进一步提高搜索性能。对于交互式任务,我们在融合结果上测试了两种不同的交互策略。我们分别为自动和交互式任务提交4次运行。每次运行的引入显示在表1中。官方评估表明,所提出的策略在自动和交互式轨道中排名第一。
translated by 谷歌翻译
Cross-speaker style transfer in speech synthesis aims at transferring a style from source speaker to synthesised speech of a target speaker's timbre. Most previous approaches rely on data with style labels, but manually-annotated labels are expensive and not always reliable. In response to this problem, we propose Style-Label-Free, a cross-speaker style transfer method, which can realize the style transfer from source speaker to target speaker without style labels. Firstly, a reference encoder structure based on quantized variational autoencoder (Q-VAE) and style bottleneck is designed to extract discrete style representations. Secondly, a speaker-wise batch normalization layer is proposed to reduce the source speaker leakage. In order to improve the style extraction ability of the reference encoder, a style invariant and contrastive data augmentation method is proposed. Experimental results show that the method outperforms the baseline. We provide a website with audio samples.
translated by 谷歌翻译
Image super-resolution (SR) is a technique to recover lost high-frequency information in low-resolution (LR) images. Spatial-domain information has been widely exploited to implement image SR, so a new trend is to involve frequency-domain information in SR tasks. Besides, image SR is typically application-oriented and various computer vision tasks call for image arbitrary magnification. Therefore, in this paper, we study image features in the frequency domain to design a novel scale-arbitrary image SR network. First, we statistically analyze LR-HR image pairs of several datasets under different scale factors and find that the high-frequency spectra of different images under different scale factors suffer from different degrees of degradation, but the valid low-frequency spectra tend to be retained within a certain distribution range. Then, based on this finding, we devise an adaptive scale-aware feature division mechanism using deep reinforcement learning, which can accurately and adaptively divide the frequency spectrum into the low-frequency part to be retained and the high-frequency one to be recovered. Finally, we design a scale-aware feature recovery module to capture and fuse multi-level features for reconstructing the high-frequency spectrum at arbitrary scale factors. Extensive experiments on public datasets show the superiority of our method compared with state-of-the-art methods.
translated by 谷歌翻译
The ability to quickly and accurately identify covariate shift at test time is a critical and often overlooked component of safe machine learning systems deployed in high-risk domains. While methods exist for detecting when predictions should not be made on out-of-distribution test examples, identifying distributional level differences between training and test time can help determine when a model should be removed from the deployment setting and retrained. In this work, we define harmful covariate shift (HCS) as a change in distribution that may weaken the generalization of a predictive model. To detect HCS, we use the discordance between an ensemble of classifiers trained to agree on training data and disagree on test data. We derive a loss function for training this ensemble and show that the disagreement rate and entropy represent powerful discriminative statistics for HCS. Empirically, we demonstrate the ability of our method to detect harmful covariate shift with statistical certainty on a variety of high-dimensional datasets. Across numerous domains and modalities, we show state-of-the-art performance compared to existing methods, particularly when the number of observed test samples is small.
translated by 谷歌翻译
视觉检索中的大多数现有方法是通过比较其全局特征向量的两种方式,该矢量错过了足够的信息并缺乏可解释性,检测图像或视频中的对象,并将文本与依赖复杂的模型设计或建模的精细元素对齐通过较低效率遭受视觉和文本令牌的交叉注意相互作用。为了解决这些局限性,最近的一些作品简单地汇总了代币的相似性以实现细粒度的对齐方式,但它们缺乏直观的解释,并且忽略了令牌级特征和具有高级语义的全球表示之间的关系。在这项工作中,我们重新考虑细粒度的跨模式对准,并为其设计一种新的模型不合命固式配方。我们还揭开了最近的流行作品的神秘面纱,并将其纳入我们的计划。此外,受最佳运输理论的启发,我们引入了\ emph {tokenflow},这是对拟议方案的实例化。通过仅修改相似性函数,我们方法的性能与主要视频文本检索基准上具有重型模型设计的SOTA算法相当。可视化进一步表明\ emph {tokenflow}成功利用细粒度的信息并获得了更好的解释性。
translated by 谷歌翻译
密集的段落检索旨在根据查询和段落的密集表示(即矢量)从大型语料库中检索查询的相关段落。最近的研究探索了改善预训练的语言模型,以提高密集的检索性能。本文提出了COT-MAE(上下文掩盖自动编码器),这是一种简单而有效的生成性预训练方法,可用于密集通道检索。 COT-MAE采用了不对称的编码器架构,该体系结构学会通过自我监督和上下文监督的掩盖自动编码来将句子语义压缩到密集的矢量中。精确,自我监督的掩盖自动编码学会学会为文本跨度内的令牌的语义建模,并学习上下文监督的蒙版自动编码学学习以建模文本跨度之间的语义相关性。我们对大规模通道检索基准进行实验,并显示出对强基础的大量改进,证明了COT-MAE的效率很高。
translated by 谷歌翻译
卷积神经网络(CNN)和变压器在多媒体应用中取得了巨大成功。但是,几乎没有努力有效,有效地协调这两个架构以满足图像的范围。本文旨在统一这两种架构,以利用其学习优点来降低图像。特别是,CNN的局部连通性和翻译等效性以及变压器中自我注意力(SA)的全球聚合能力被完全利用用于特定的局部环境和全球结构表示。基于雨水分布揭示降解位置和程度的观察,我们在帮助背景恢复之前引入退化,并因此呈现关联细化方案。提出了一种新型的多输入注意模块(MAM),以将降雨的去除和背景恢复关联。此外,我们为模型配备了有效的深度可分离卷积,以学习特定的特征表示并权衡计算复杂性。广泛的实验表明,我们提出的方法(称为ELF)的表现平均比最先进的方法(MPRNET)优于0.25 dB,但仅占其计算成本和参数的11.7 \%和42.1 \%。源代码可从https://github.com/kuijiang94/magic-elf获得。
translated by 谷歌翻译
视频文本发现(VTS)是需要同时检测,跟踪和识别视频中文本的任务。现有的视频文本发现方法通常开发复杂的管道和多个模型,这不是实时应用程序的朋友。在这里,我们提出了一个带有对比表示学习(Cotext)的实时端到端视频文本检测器。我们的贡献分为三个:1)Cotext同时解决实时端到端可训练框架中的三个任务(例如,文本检测,跟踪,识别)。 2)通过对比度学习,Cotext模拟了多个帧的长距离依赖性和学习时间信息。 3)简单,轻巧的体系结构设计用于有效和准确的性能,包括带有蒙版ROI的基于CTC的GPU - 平行检测后处理。广泛的实验显示了我们方法的优越性。尤其是,Cotext在ICDAR2015VIDEO上以41.0 fps的速度实现了一个视频文本,以72.0%的IDF1命中,其video的范围为10.5%和32.0 fps,改进了先前的最佳方法。该代码可以在github.com/weijiawu/cotext上找到。
translated by 谷歌翻译